Cygwin cleanup

  • Jump to comment-1
    thomas.munro@gmail.com2022-07-26T04:24:25+00:00
    Hi, Continuing a discussion started over at [1]. Moving this to a new thread so that other thread can focus on Unix cleanup, and both threads can get CI coverage... 1. In a few places, it is alleged that both __CYGWIN__ and WIN32 might be defined at the same time. Do you think we should try to get rid of that possibility? I understand that we have to have a few tests for __CYGWIN__ here and there, because eg file permissions don't work quite right and there's not much we can do about that. But it seems a bit unhelpful if we also have to worry about a more-or-less POSIX-ish build taking WIN32 paths at uncertain times if we forget to defend against that, or wonder why some places are not consistent. A quick recap of the three flavours of Windows platform we have to handle, as I understand it: * MSVC: Windowsy toolchain, Windowsy C * custom perl scripts instead of configure * msbuild instead of make * MSVC compiler * Windows C APIs * we provide our own emulation of some POSIX C APIs on top * MSYS: Unixy toolchain, Windowsy C * configure (portname = "win32") * make * GCC compiler * Windows C APIs * we provide our own emulation of some POSIX C APIs on top * Cygwin: Unixy toolchain, Unixy C * configure (portname = "cygwin") * make * GCC compiler * POSIX C APIs (emulations provided by the Cygwin runtime libraries) (The configure/make part will be harmonised by the Meson project.) The macro WIN32 is visibly defined by something in/near msbuild in MSVC builds: /D WIN32 is right here in the build transcripts (whereas the compiler defines _WIN32; good compiler). I am not sure how exactly it is first defined in MSYS builds; I suspect that MSYS gcc might define it itself, but I don't have access to MSYS to check. As for Cygwin, the only translation unit where I could find both __CYGWIN__ and WIN32 defined is dirmod.c, and that results from including <windows.h> and ultimately <minwindef.h> (even though WIN32 isn't defined yet at that time). I couldn't understand why we do that, but I probably didn't read enough commit history. The purpose of dirmod.c on Cygwin today is only to wrap otherwise pure POSIX code in retry loops to handle those spurious EACCES errors due to NT sharing violations, so there is no need for that. Proposal: let's make it a programming rule that we don't allow definitions from Windows headers to leak into Cygwin translation units, preferably by never including them, or if we really must, let's grant specific exemptions in an isolated and well documented way. We don't seem to need any such exemptions currently. Places where we currently worry about the contradictory macros could become conditional #error directives instead. 2. To make it possible to test any of that, you either need a working Windows+Cygwin setup, or working CI. I'm a salty old Unix hacker so I opted for the latter, and I also hope this will eventually be useful to others. Unfortunately I haven't figured out how to get everything working yet, so some of the check-world tests are failing. Clues most welcome! The version I'm posting here is set to run always, so that cfbot will show it alongside others. But I would imagine that if we got a committable-quality version of this, it'd probably be opt-in, so you'd have to say "ci-os-only: cygwin", or "ci-os-only: cygwin, windows" etc in a commit to your private github account to ask for it (or maybe we'll come up with a way to tell cfbot we want the full works of CI checks; the same decision will come up for MSYS, OpenBSD and NetBSD CI support that my colleague is working on). There are other things to fix too, including abysmal performance; see commit message. 3. You can't really run PostgreSQL on Cygwin for real, because its implementation of signals does not have reliable signal masking, so unsubtle and probably also subtle breakage occurs. That was reported upstream by Noah years ago, but they aren't working on a fix. lorikeet shows random failures, and presumably any CI system will do the same... I even wondered about putting our own magic entry/exit macros into signal handlers, that would use atomics to implement a second level of signal masking (?!) but that's an uncommonly large bandaid for a defective platform... and trying to fix Cygwin itself is a rabbithole too far for me. 4. When building with Cygwin GCC 11.3 you get a bunch of warnings that don't show up on other platforms, seemingly indicating that it interprets -Wimplicit-fallthrough=3 differently. Huh? [1] https://www.postgresql.org/message-id/CA%2BhUKGKZ_FjkBnjGADk%2Bpa2g4oKDcG8%3DSE5V23sPTP0EELfyzQ%40mail.gmail.com
    • Jump to comment-1
      pryzby@telsasoft.com2022-07-27T06:44:25+00:00
      On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote: > 3. You can't really run PostgreSQL on Cygwin for real, because its > implementation of signals does not have reliable signal masking, so > unsubtle and probably also subtle breakage occurs. That was reported > upstream by Noah years ago, but they aren't working on a fix. > lorikeet shows random failures, and presumably any CI system will do > the same... Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com On my 2nd try: https://cirrus-ci.com/task/5311911574110208 TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370) 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal 6: Aborted > XXX Doesn't get all the way through yet... Mainly because getopt was causing all tap tests to fail. I tried to fix that in configure, but ended up changing the callers. This is getting close, but I don't think has actually managed to pass all tests yet.. https://cirrus-ci.com/task/5274721116749824 > 4. When building with Cygwin GCC 11.3 you get a bunch of warnings > that don't show up on other platforms, seemingly indicating that it > interprets -Wimplicit-fallthrough=3 differently. Huh? Evidently due to the same getopt issues. > XXX This should use a canned Docker image with all the right packages > installed Has anyone tried using non-canned images ? It sounds like this could reduce the 4min startup time for windows. https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > XXX configure is soooo slooow, can we cache it?! Compiling is also > insanely slow, but ccache gets it down to a couple of minutes if you're > lucky One reason compiling was slow is because you ended up with -O2. You can cache configure as long as you're willing to re-run it whenever options were changed. That also applies to the existing headerscheck. > XXX I don't know how to put variables like BUILD_JOBS into the scripts WDYM ? If it's outside of bash and in windows shell it's like %var%, right ? https://cirrus-ci.org/guide/writing-tasks/#environment-variables I just noticed that cirrus is misbehaving: if there's a variable called CI (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}. I've also seen weirdness when variable names or operators appear in the commit message... > XXX Needs some --with-X options Done > XXX We would never want this to run by default in CI, but it'd be nice > to be able to ask for it with ci-os-only! (See commented out line) > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' Doesn't this already do what's needed? As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', the task will runs only on request. > XXX I have no idea if crash dump works, and if this should share > elements with the msys work in commitfest #3575 Based on the crash above, it wasn't working. And after some changes ... it still doesn't work. windows_os is probably skipping too many things. -- Justin
      • Jump to comment-1
        thomas.munro@gmail.com2022-07-28T22:04:04+00:00
        On Wed, Jul 27, 2022 at 6:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, Jul 26, 2022 at 04:24:25PM +1200, Thomas Munro wrote: > > 3. You can't really run PostgreSQL on Cygwin for real, because its > > implementation of signals does not have reliable signal masking, so > > unsubtle and probably also subtle breakage occurs. That was reported > > upstream by Noah years ago, but they aren't working on a fix. > > lorikeet shows random failures, and presumably any CI system will do > > the same... > > Reference: https://www.postgresql.org/message-id/20170321034703.GB2097809%40tornado.leadboat.com > > On my 2nd try: > > https://cirrus-ci.com/task/5311911574110208 > TRAP: FailedAssertion("mq->mq_sender == NULL", File: "shm_mq.c", Line: 230, PID: 16370) > 2022-07-26 06:32:35.525 PDT [15538][postmaster] LOG: background worker "parallel worker" (PID 16370) was terminated by signal 6: Aborted Thanks for working on this! Huh, that Cygwin being shipped by Choco is quite old, older than lorikeet's, but not old enough to not have the bug: [04:33:55.234] Starting cygwin install, version 2.918 Based on clues in Noah's emails in the archives, I think versions from maybe somewhere around 2015 didn't have the bug, and then the bug appeared, and AFAIK it's still here. I wonder if you can tell Choco to install an ancient version, but even if that's possible you'd be dealing with other stupid problems and bugs. > > XXX Doesn't get all the way through yet... > > Mainly because getopt was causing all tap tests to fail. > I tried to fix that in configure, but ended up changing the callers. > > This is getting close, but I don't think has actually managed to pass all tests > yet.. https://cirrus-ci.com/task/5274721116749824 Woo. > > 4. When building with Cygwin GCC 11.3 you get a bunch of warnings > > that don't show up on other platforms, seemingly indicating that it > > interprets -Wimplicit-fallthrough=3 differently. Huh? > > Evidently due to the same getopt issues. Ahh, nice detective work. > > XXX This should use a canned Docker image with all the right packages > > installed > > Has anyone tried using non-canned images ? It sounds like this could reduce > the 4min startup time for windows. > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment Yeah, I had that working once. Not sure what the pros and cons would be for us. > > XXX configure is soooo slooow, can we cache it?! Compiling is also > > insanely slow, but ccache gets it down to a couple of minutes if you're > > lucky > > One reason compiling was slow is because you ended up with -O2. Ah, right. > You can cache configure as long as you're willing to re-run it whenever options > were changed. That also applies to the existing headerscheck. > > > XXX I don't know how to put variables like BUILD_JOBS into the scripts > > WDYM ? If it's outside of bash and in windows shell it's like %var%, right ? > https://cirrus-ci.org/guide/writing-tasks/#environment-variables Right. I should have taken the clue from the %cd% (I got a few ideas about how to do this from libarchive's CI scripting[1]). > I just noticed that cirrus is misbehaving: if there's a variable called CI > (which there is), then it expands $CI_FOO like ${CI}_FOO rather than ${CI_FOO}. > I've also seen weirdness when variable names or operators appear in the commit > message... > > > XXX Needs some --with-X options > > Done Neat. > > XXX We would never want this to run by default in CI, but it'd be nice > > to be able to ask for it with ci-os-only! (See commented out line) > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > Doesn't this already do what's needed? > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > the task will runs only on request. Yeah I was just trying to say that I was sharing the script in a way that always runs, but for commit we'd want that. This is all far too slow for cfbot to have to deal with on every build. Looks like we can expect to be able to build and test fast on Windows soonish, though, so maybe one day we'd just turn Cygwin and MSYS on? [1] https://github.com/libarchive/libarchive/blob/master/build/ci/cirrus_ci/ci.cmd
        • Jump to comment-1
          pryzby@telsasoft.com2022-08-04T03:38:28+00:00
          On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > > XXX We would never want this to run by default in CI, but it'd be nice > > > to be able to ask for it with ci-os-only! (See commented out line) > > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > > > Doesn't this already do what's needed? > > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > > the task will runs only on request. > > Yeah I was just trying to say that I was sharing the script in a way > that always runs, but for commit we'd want that. That makes more sense after noticing that you created a cf entry (for which cfbot has been skipping my patch due to my "only_if" line). There's still a few persistent issues: This fails ~50% of the time in recovery 010-truncate I hacked around this by setting data_sync_retry. https://cirrus-ci.com/task/5289444063313920 I found these, not sure if they're relevant. https://www.postgresql.org/message-id/flat/CAA4eK1Kft05mwNuZbTVRmz8SNS3r%2BuriuCT8DxL5KJy5btoS-A%40mail.gmail.com https://www.postgresql.org/message-id/flat/CAFiTN-uGxgo5258hZy2QJoz%3Ds7_Cs7v9%3Db8Z2GgFV7qmQUOwxw%40mail.gmail.com And an fsync abort in 013 which seems similar to this other one. data_sync_retry also avoids this issue. https://cirrus-ci.com/task/6283023745286144?logs=cores#L34 https://www.postgresql.org/message-id/flat/CAMVYW_4QhjZ-19Xpr2x1B19soRCNu1BXHM8g1mOnAVtd5VViDw%40mail.gmail.com And sometimes various assertions failing in regress parallel_select (and then times out) https://api.cirrus-ci.com/v1/artifact/task/5537540282253312/log/src/test/regress/log/postmaster.log https://api.cirrus-ci.com/v1/artifact/task/6108746773430272/log/src/test/regress/log/postmaster.log Or "could not map dynamic shared memory segment" (actually in 027-stream-regress): https://cirrus-ci.com/task/6168860746317824 And segfault in vacuum parallel https://api.cirrus-ci.com/v1/artifact/task/5404589569605632/log/src/test/regress/log/postmaster.log Sometimes semctl() failed: Resource temporarily unavailable https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/test/subscription/tmp_check/log/014_binary_publisher.log https://api.cirrus-ci.com/v1/artifact/task/5027860623654912/log/src/bin/pg_rewind/tmp_check/log/001_basic_standby_local.log Some more https://cirrus-ci.com/task/6468927780814848 If you're lucky, there's only 1 or 2 problems, of which those are different symptoms.. Maybe for now this needs to disable tap tests :( This shows that it *can* pass, if slowly, and infrequently: https://cirrus-ci.com/task/6546858536337408 This fixes my changes to configure for getopt. And simplifies the changes to *.pl (the .exe changes weren't necessary at all). And removes the changes for implicit-fallthrough; I realized that configure was just deciding that it didn't work and not using it at all. And adds support for backtraces. And remove kerberos and and add libxml Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus handles that automatically. -- Justin
          • Jump to comment-1
            thomas.munro@gmail.com2022-08-04T04:16:06+00:00
            On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > [train wreck] Oh my, so I'm getting the impression we might actually be totally unstable on Cygwin. Which surprises me because ... wait a minute ... lorikeet isn't even running most of the tests. So... we don't really know the degree to which any of this works at all? > This shows that it *can* pass, if slowly, and infrequently: > https://cirrus-ci.com/task/6546858536337408 Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm one step closer to what Tom said, re wasting developer time... > [lots of improvements] Cool. > Why did you write "|| exit /b 1" in all the bash invocations ? I think cirrus > handles that automatically. Cargo-culted from libarchive.
            • Jump to comment-1
              thomas.munro@gmail.com2022-08-04T05:19:47+00:00
              On Thu, Aug 4, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Aug 4, 2022 at 3:38 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > [train wreck] > > Oh my, so I'm getting the impression we might actually be totally > unstable on Cygwin. Which surprises me because ... wait a minute ... > lorikeet isn't even running most of the tests. So... we don't really > know the degree to which any of this works at all? Hmm, it's possible that all these failures are just new-to-me effects of the known bug. Certainly the assertion failures are of the usual type, and I think it might be possible for the weird parallel query failure to be explained by the postmaster forking extra phantom child processes. It may be madness to try to work around this, but I wonder if we could use a static local variable that we update with atomic compare exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. On entry, if you can do 0->1 it means you are allowed to run the function. If it's non-zero, set n->n+1 and return immediately: signal blocked, but queued for later. On exit, you CAS n->0. If n was > 1, then you have to jump back to the top and run the function body again.
              • Jump to comment-1
                tgl@sss.pgh.pa.us2022-08-04T05:23:09+00:00
                Thomas Munro <thomas.munro@gmail.com> writes: > It may be madness to try to work around this, but I wonder if we could > use a static local variable that we update with atomic compare > exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and > PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. > On entry, if you can do 0->1 it means you are allowed to run the > function. If it's non-zero, set n->n+1 and return immediately: signal > blocked, but queued for later. On exit, you CAS n->0. If n was > 1, > then you have to jump back to the top and run the function body again. And ... we're expending all this effort for what exactly? regards, tom lane
                • Jump to comment-1
                  thomas.munro@gmail.com2022-08-04T06:07:35+00:00
                  On Thu, Aug 4, 2022 at 5:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > It may be madness to try to work around this, but I wonder if we could > > use a static local variable that we update with atomic compare > > exhange, inside PG_SIGNAL_HANDLER_ENTRY(), and > > PG_SIGNAL_HANDLER_EXIT() macros that do nothing on every other system. > > On entry, if you can do 0->1 it means you are allowed to run the > > function. If it's non-zero, set n->n+1 and return immediately: signal > > blocked, but queued for later. On exit, you CAS n->0. If n was > 1, > > then you have to jump back to the top and run the function body again. > > And ... we're expending all this effort for what exactly? I'd be almost as happy if we ripped it all out, shut down lorikeet and added it to the list of fallen platforms. I'd feel a bit like a vandal, though. My suggestion is a last-ditch idea for Noah (CCd) and/or Andrew to consider, who (respectively) blocked this last time and run lorikeet. No plans to write that patch myself...
            • Jump to comment-1
              andres@anarazel.de2022-08-04T04:22:55+00:00
              Hi, On 2022-08-04 16:16:06 +1200, Thomas Munro wrote: > Ok, that's slightly reassuring, so maybe we *can* fix this, but I'm > one step closer to what Tom said, re wasting developer time... It might be worth checking whether the cygwin installer, which at some point at least allowed installing postgres, has download numbers available anywhere. It's possible we could e.g. get away with just allowing libpq to be built. Greetings, Andres Freund
        • Jump to comment-1
          pryzby@telsasoft.com2022-07-28T22:23:19+00:00
          On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > Thanks for working on this! > > Huh, that Cygwin being shipped by Choco is quite old, older than > lorikeet's, but not old enough to not have the bug: > > [04:33:55.234] Starting cygwin install, version 2.918 Hm, I think that's the version of "cygwinsetup" but not cygwin.. It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved] > I wonder if you can tell Choco > to install an ancient version, but even if that's possible you'd be > dealing with other stupid problems and bugs. Yes: choco install -y --no-progress --version 4.6.1 ccache > > > XXX This should use a canned Docker image with all the right packages > > > installed > > > > Has anyone tried using non-canned images ? It sounds like this could reduce > > the 4min startup time for windows. > > > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > > Yeah, I had that working once. Not sure what the pros and cons would be for us. I think it could be a lot faster to start, since cirrus caches the generated docker image locally. Rather than (I gather) pulling the image every time. > > > XXX We would never want this to run by default in CI, but it'd be nice > > > to be able to ask for it with ci-os-only! (See commented out line) > > > only_if: $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*cygwin.*' > > > > Doesn't this already do what's needed? > > As long as it doesn't also check: CHANGE_MESSAGE !~ 'ci-os-only', > > the task will runs only on request. > > Yeah I was just trying to say that I was sharing the script in a way > that always runs, but for commit we'd want that. This is all far too > slow for cfbot to have to deal with on every build. It occurred to me today that if cfbot preserved the original patch series, and commit messages, that would allow patch authors to write things like "ci-os-only: docs" for a doc only patch. I've never gotten cirrus' changesOnly() stuff to work... > Looks like we can expect to be able to build and test fast on Windows > soonish, though, Do you mean with meson ? > so maybe one day we'd just turn Cygwin and MSYS on? I didn't understand this ? -- Justin
          • Jump to comment-1
            andres@anarazel.de2022-08-04T04:06:05+00:00
            Hi, On 2022-07-28 17:23:19 -0500, Justin Pryzby wrote: > On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > > > XXX This should use a canned Docker image with all the right packages > > > > installed > > > > > > Has anyone tried using non-canned images ? It sounds like this could reduce > > > the 4min startup time for windows. > > > > > > https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment > > > > Yeah, I had that working once. Not sure what the pros and cons would be for us. > > I think it could be a lot faster to start, since cirrus caches the generated > docker image locally. Rather than (I gather) pulling the image every time. I'm quite certain that is not true. All the docker images built are just uploaded to the google container registry and then downloaded onto a *separate* windows host. The dockerfile: stuff generates a separate task running on a separate machine... It's a bit better for non-windows containers, because there google has some optimization for pulling image (pieces) on demand or such. Greetings, Andres Freund
          • Jump to comment-1
            thomas.munro@gmail.com2022-07-28T22:29:41+00:00
            On Fri, Jul 29, 2022 at 10:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Jul 29, 2022 at 10:04:04AM +1200, Thomas Munro wrote: > > [04:33:55.234] Starting cygwin install, version 2.918 > > Hm, I think that's the version of "cygwinsetup" but not cygwin.. > It also says this: [13:16:36.014] Cygwin v3.3.4.20220408 [Approved] Oops. Ok so we're testing the very latest then, and it definitely still has the bug as we thought. > It occurred to me today that if cfbot preserved the original patch series, and > commit messages, that would allow patch authors to write things like > "ci-os-only: docs" for a doc only patch. I've never gotten cirrus' > changesOnly() stuff to work... Maybe it's time to switch to "git am -3 ..." and reject patches that don't apply that way. > > Looks like we can expect to be able to build and test fast on Windows > > soonish, though, > > Do you mean with meson ? Yeah. Also there are some other things we can do to speed up testing on Windows (and elsewhere), like not running every test query with new psql + backend process pair, which takes at least a few hundred ms and sometimes up to several seconds on this platform; I have some patches I need to finish... > > so maybe one day we'd just turn Cygwin and MSYS on? > > I didn't understand this ? I mean, if, some sunny day, we can compile and test on Windows at non-glacial speeds, then it would become possible to contemplate having cfbot run these tasks for every patch every time.
    • Jump to comment-1
      tgl@sss.pgh.pa.us2022-07-26T04:34:18+00:00
      Thomas Munro <thomas.munro@gmail.com> writes: > 3. You can't really run PostgreSQL on Cygwin for real, because its > implementation of signals does not have reliable signal masking, so > unsubtle and probably also subtle breakage occurs. That was reported > upstream by Noah years ago, but they aren't working on a fix. > lorikeet shows random failures, and presumably any CI system will do > the same... If that's an accurate statement, shouldn't we just drop Cygwin support? Now that we have a native Windows build, it's hard to see how any live user would prefer to use the Cygwin build. regards, tom lane
      • Jump to comment-1
        thomas.munro@gmail.com2022-07-26T05:16:48+00:00
        On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > 3. You can't really run PostgreSQL on Cygwin for real, because its > > implementation of signals does not have reliable signal masking, so > > unsubtle and probably also subtle breakage occurs. That was reported > > upstream by Noah years ago, but they aren't working on a fix. > > lorikeet shows random failures, and presumably any CI system will do > > the same... > > If that's an accurate statement, shouldn't we just drop Cygwin support? This thread rejected the idea last time around: https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com lorikeet still shows the issue. Failures often involve assertions about PMSignalState or mq->mq_sender. Hmm, it's running Cygwin 3.2.0 (March 2021) and the latest release is 3.3.5, so it's remotely possible that it's been fixed recently. Maybe that'd be somewhere in here, but it's not jumping out: https://github.com/cygwin/cygwin/commits/master/winsup/cygwin/signal.cc (Oooh, another implementation of signalfd...)
        • Jump to comment-1
          tgl@sss.pgh.pa.us2022-07-26T11:40:47+00:00
          Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Jul 26, 2022 at 4:34 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If that's an accurate statement, shouldn't we just drop Cygwin support? > This thread rejected the idea last time around: > https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com I think maybe we should re-open the discussion. I've certainly reached the stage of fed-up-ness. That platform seems seriously broken, upstream is making no progress on fixing it, and there doesn't seem to be any real-world use-case. The only positive argument for it is that Readline doesn't work in the other Windows builds --- but we've apparently not rechecked that statement in eighteen years, so maybe things are better now. If we could just continue to blithely ignore lorikeet's failures, I wouldn't mind so much; but doing any significant amount of new code development work for the platform seems like throwing away developer time. regards, tom lane
          • Jump to comment-1
            robertmhaas@gmail.com2022-07-26T17:08:56+00:00
            On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think maybe we should re-open the discussion. I've certainly > reached the stage of fed-up-ness. That platform seems seriously > broken, upstream is making no progress on fixing it, and there > doesn't seem to be any real-world use-case. The only positive > argument for it is that Readline doesn't work in the other > Windows builds --- but we've apparently not rechecked that > statement in eighteen years, so maybe things are better now. > > If we could just continue to blithely ignore lorikeet's failures, > I wouldn't mind so much; but doing any significant amount of new > code development work for the platform seems like throwing away > developer time. I agree with that. All things being equal, I like the idea of supporting a bunch of different platforms, and Cygwin doesn't really look that dead. It has recent releases. But if blocking signals doesn't actually work on that platform, making PostgreSQL work reliably there seems really difficult. -- Robert Haas EDB: http://www.enterprisedb.com
            • Jump to comment-1
              thomas.munro@gmail.com2022-07-28T22:57:52+00:00
              On Wed, Jul 27, 2022 at 5:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 26, 2022 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think maybe we should re-open the discussion. I've certainly > > reached the stage of fed-up-ness. That platform seems seriously > > broken, upstream is making no progress on fixing it, and there > > doesn't seem to be any real-world use-case. The only positive > > argument for it is that Readline doesn't work in the other > > Windows builds --- but we've apparently not rechecked that > > statement in eighteen years, so maybe things are better now. > > > > If we could just continue to blithely ignore lorikeet's failures, > > I wouldn't mind so much; but doing any significant amount of new > > code development work for the platform seems like throwing away > > developer time. > > I agree with that. All things being equal, I like the idea of > supporting a bunch of different platforms, and Cygwin doesn't really > look that dead. It has recent releases. But if blocking signals > doesn't actually work on that platform, making PostgreSQL work > reliably there seems really difficult. It's one thing to drop old dead Unixes but I don't think anyone would enjoy dropping support for an active open source project. The best outcome would be for people who have an interest in seeing PostgreSQL work correctly on Cygwin to help get the bug fixed. Here are the threads I'm aware of: https://cygwin.com/pipermail/cygwin/2017-August/234001.html https://cygwin.com/pipermail/cygwin/2017-August/234097.html I wonder if these problems would go away as a nice incidental side-effect if we used latches for postmaster wakeups. I don't know... maybe, if the problem is just with the postmaster's pattern of blocking/unblocking? Maybe backend startup is simple enough that it doesn't hit the bug? From a quick glance, I think the assertion failures that occur in regular backends can possibly be blamed on the postmaster getting confused about its children due to unexpected handler re-entry.